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ABSTRACT 

The 5U papers related to test development^ 
interpretation^ and use that were presented at the 1972 AERA 
Conference are reviewed. The papers were classified into 11 
categories^ as follows: A. What to measure — educational objectives; 
attitude measurement; and creativity; B. How to measure — item types 
test development; response modifications; confidence weighting; 
semantic differential and observational techniques; and C. Test 
use — testing programs; and test bias. A listing of the papers 
reviewed^ their authors, and, when applicable^ the ED numbers 
concludes the summary. (DB) 
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INTRODUCTION 



About 700 of the 1,000 papers presented at the 1972 AERA Annual Meeting in Chicago, 
Illinois were collected by the ERIC Clearinghouse on Tests, Measurement, and Evaluation 
(ERIC/TM). ERIC/TM indexed and abstracted for announcement in Research in 
Education (RIE) 200 papers which fell within our area of interest-testing, measurement, 
and evaluation. The remaining papers were distributed to the other Clearinghouses in the 
ERIC system for processing. 

Because of an interest in thematic summaries of AERA papers on the part of a large 
segment of ERIC/TM users, we decided to invite a group of authors to assist us in 
producing such a series based on the materials processed for RIE. Four topics were 
chosen for the series: Criterion Referenced Measurement, Evaluation, Statistics, and 
Test Construction. 

Most papers referred to in this summary may be obtained in either hard copy or 
microfiche form from: 

ERIC Document Reproduction Service (EDRS) 
P.O. Drawer 0 
Bethesda, Maryland 20014 

Prices and ordering information for these documents may be found in any current 
issue of Research in Education, 
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The 54 papers in this area which were presented at the 
annual meeting in Chicago, April 1972, were classified 
into eleven categories as shown below. The figure it\ 
parenthesis is the number of studies in the category. 

A. What to measure 

1 . Educational objectives (3) 

2. Attitude measurement (7) 

3. Creativity (7) 

B. How to measure 

4. Item types (2) 

5. Test development (7) 

6. Response modifications (5) 

7. Confidence weighting (4) 

8. Semantic differential (5) 

9. Observational techniques (3) 

C. Test use 

10. Testing programs (8) 

1 1. Test bias (3) 

1 . Educational Objectives 

Hoepfner obtained ratings of 106 gcal statements from 
2,555 individual raters. Self esteem received the higliest 
rating (4.67 on a five-point scale of importance). Writing 
fluency in a foreign language received the k'West ratifg 
(1.48). The author cautions against the assumption of 
general validity for his findings, but believes that the 
procedure used is worthy of wider application. 

A Q-sort was used by Doherty to differentiate tiie goal 
valuations of elementary school teachers, principals and 
paren'.s when classified along demographic lines such as 
racial-ethnic composition of the student body He found 
that the demographic variables seemed to have little 
influence on the priority rating of the goals. 

Dyer and others describe a methodologically sophisti- 
cated utility function estimation procedure designed to 
provide curriculum planning information to elementary 
school principles. The reader who is not already well 
versed in utilitv function methodology may be mclined to 
question the auihors' claim that it is a "simple proce- 
dure.*' An elemental^ school principal may have difficulty 
in perceiving its utility for him. 

2. Attitude Measurement 

Kilbane reported de* "opment via factor analysis of a 30 
item inventory designed to tap high school students' 



attitudes toward self, tov ?. -r'^^rs " A toward social 
participation. She reported th.<' i nrument appeared 
to differentiate between stude. *s who remained in school 
and those who dropped out. A similar effort to develop 
an instrument to measure the attitudes of seventh grade 
pupils toward school learning was repoited by Brehman. 
The new instrument was shown to have higli internal 
consistency. No validity data was reported. 

Assessment of the quality of education in the public 
schools of Pennsylvania has been a high priority objective 
of the Pennsylvania Department of Education in recent 
years. Brehman 's study and the two which follow were 
outgrowths of efforts to assess student attitudes toward 
school learning. 

McGuiness and Stanh used the technique of sociometry 
to validate a measure of understanding and appreciation 
of persons belonging to social, cultural and ethnic groups 
different from one's t vn. The instrument showed high 
(.95) internal consistency, mcderat^ ( T^) retest reliability 
after seven months, and low t ':')ierate validity coef- 
ficients (.28 to .73). 

Landis reported a study of the cc ' * ,f validity of a 
self esteem inventory. It was predict* u .aa^ students 
making higher scores on a standardized arV" vement test 
battery would also make higher scores on tne seif esteem 
inventory. It was also predicted that scores on the self 
esteem inventory would correlate positively with scores on 
a self concept as a learner bcale. Both predictions were 
substantiated. 

The influence of a mothers self concept on the 
deprived child's selt^concept has been investigated in 
several studies by locco and Bridges. The present report 
■Jtes several significant relationships which were demon- 
strated by canonical correlation techniques 

Gable and Pruzek compared the results of two me- 
thods of categorizing the items in a scale for measuring 
attitudes toward black people. One method applied latent 
partition analysis to classifications of the items by judges. 
The other method applied factor analysis to the responses 
of college students to the same items. The two methods 
showed substantial agreement. The use of this combina- 
tion of approaches provides some validation of the 
attitude constructs being assessed by the instrument. 

In a brief report Shoemaker described a method for 
reducing the labor of scaling attitude test items by the 
method of paired comparison. Randomly selected subsets 
of pairings were given to random samples of the judges. 
He concluded that the sampling procedure yielded scale 
values satisfactorily approximating those obtained from 
the whole population of pairings and of judges. No data 



ca results of data analysis are included in the report, 
however. 

3. Creativity 

A 29 item inventory of creative accomplishment was 
administered to 166 college students, along with three 
tests of convergent thinking and five of divergent think- 
ing, in a study reported by Stafford and Browne. The 
data indicated to the authors that creative accomplish- 
ment is related more closely to fluencv than it is to 
originality, convergent thinking or drive. The data, how- 
ever, were not included in the report. 

A study of the relation of three cognitive styles, 
response tempo, response style and response ambiguity, to 
creative problem solving was reported by Hyer and 
Rookey. Analysis of data resulting ^rom administration of 
tests of intelligence, cognitive s*yl» and creativity to 288 
junior high school students led to the conclusion that 
creative proble:n solving is about equally affected by 
intelligence and the cognitive styles. 

Using as cricaria the ratings by two professors of the 
creativity of 34 graduate student writers, Barro found 
higher correlations with personality than with cognitive 
test scores. Tlie correlations tended to be low, and not 
significantly different from zero for a majority of the 
possible predii. >r variables. No reliability coefficients 
were reported for any of the variables. 

Stallings and Gillmore investigated the relation between 
measures of creativity and grades in courses presumably 
eliciting creative behavior. Measures of creativity were 
obtained from the Torrance Figural Test for over 300 
freshmen in the College of Fine and Applied Arts at the 
University of Illinois. Only one validity coefficient, of the 
68 generated, was significant at the .05 level. The authors 
concluded that scores from this test had little utility in 
enhancing the prediction of grades in these courses, and 
that their data did not support the validity of the test. 

The effects of practice on the nonverbal creativity of 
fifth grade children was investigated by Roweton and 
Spencer, usmg forms A and B of Torrance's picture 
completion task. They found the overall effects to be less 
marked than in the case of verbal creativity, and more 
dependent on peculiarities of the task item. 

Shigaki also studied the effect of practice (trend of 
scores over time) on originality scores obtained from 56 
protocols resulting from administration of the Torrance 
Tests of Creative Thinking to intermediate grade children. 
She found significant improvement in verbal originality 
test scores but not in the figural test scores. 

Prediction equations have been developed to simulate 
by computer the behavior of human judges in rating 
creativity test performances. Greene and Zirkel undertook 
to determine the stability and usefulness of these equa- 



tions when applied to samples drawn from other popula- 
tions. They concluded that equations predicting ratings of 
fluency and fl'^xibility were stable enough to be useful, 
but ihat those designed to predict ratings of originality 
were not. 

4. Item Types 

In a study comparing multiple-choice and true-false test 
items, 0( *terhof and Glasnapp found lower reliability for 
tests composed of true-false items than for tests com- 
posed of multiple choice items even after adjustment for 
differences in time requirements. They also found that 
false versions of an item contributed more to reliability 
than true versions, and that multiple choice items re- 
quired about 1.75 times as much time to answer as 
true-false items. Subjects for their study were 101 
undergraduates enrolled in an introductory measurements 
course. 

In a similar study using 1,018 high sciiooi students as 
subject';, Frisbie and Eble obtained similar results. Their 
subjects required 1.5 times as much time to answer a 
multiple choice item as they required to answer a 
true-false item. In all cases the true-false tests were less 
reliable than the multiple choice tests even after adjust- 
ment for differences in time required. The data from this 
study did not justify rejection of the hypothesis that the 
two types of tests measure the same thing. 

5. Test Development 

Orpet's report on the development of an expLrimental 
sensory motor and movement skills test battery empha- 
sizes the potential usefulness of such a battery, provides a 
brief rationale for each of the tests included in it, 
describes the standardization procedure, and gives lower 
bound reliability estimates derived from communalities 
from the factor analysis. No data on the score distribu- 
tions from the various tests for children in various grades 
are reported. 

Pandey and Cleary describe the development of a test 
of basic skills for adults enrolled in literacy and other 
remedial programs. The test includes 20 communications 
items, 19 numerical items and 28 items measuring 
practical skills in such things as using a telephone 
directory or filling out forms. The subtests are shown to 
have high reliability and to differentiate seventh, eiglith 
and ninth grade students clearly. 

Development of a test to measure occupational aware- 
ness was described irt a report by Reardon and others. 
The items were based on information from the dictionary 
of occupational titles with emphasis on worker traits in 
different occupations. No sample items are included in 
the report. The final form of the test had a KR 20 



reliability of .769 in statewide administration to 2,640 
pupils in 90 schools in Pennsylvania. (It is interesting to 
note that the number of items in the test is reported to 
be 30.000!) School means ranged widely from 10.57 to 
21.77, indicating that the test will discriminate clearly 
among schools. 

Millman discussed several bases for determining the 
passing score on a criterion-referenced test. He also 
described a means for determining the relation between 
the passing score on a test, the number of items in the 
test, and the percent of students wrongly passed and 
failed. A table illustrating these relationships is included 
in the report. 

A computer program for textual analysis was used by 
Felsenthal and Felsenthal to obtain data from 20 trade 
books for children for computation of readability indices 
by various formulas. Data on the means, standaid devia- 
tions and intercorrelations of the various readability 
measures is reported. Analysis of variance- showed no 
significant difference in the readability of maierial in the 
first, middle or last thirds of the buoks. No data are 
presented on the advantages or disadvantages of computer 
assistance with this task. 

Hofmann, in a long and elaborately mathematical 
paper discussed an efficiency index for use in item 
analysis. It is defined as the ratio of the observed 
discrimination to the maximum possible discrimination 
for an item of that difficulty level. The author shows that 
the index lends itself to a variety of interpretations, many 
of which can be given as probability statements. He 
suggests, but does not demonstrate, thai the use of this 
index will enable test-makers to build better tests. 

In a long report of a complex study Tyler analyzed the 
relation of response stability to personality test homo- 
geniety, and reported data gathered to test the analysis. 
The data were obtained from 22 dichotoniously scored 
personality tests. Persons and items were scaled according 
to the Rasch model. Tyler found that response instability 
was a joint function of subject and item scale values. 
While heterogeneous tests provide the best predictions in 
practical applications, homogeneous tests have more to 
contribute in the development of personality theory, 

6. Response modifications 

Kociiler reviewed the iiitionale for and experimental 
assessments of several modifications of conventional best 
answer response to multiple choice test items. One of 
these asks the examinee to cross out every response he 
knows to be incorrect. Another asks him to work as 
many as necessary to be sure of including the correct 
response. Koehler concluded that the experimental data 
provide substantial evidence in favor of the continued use 
c the conventional one-best-answer response to multiple 



choice test items. 

Reilly and Jackson reported that empirical response 
weighting of the multiple choice items in the aptitude test 
of the Graduate Record Examination substantially in- 
creased test rcliabUity but did not increase test validity. 
They suggested that the empirical weighting capitalized on 
the tendency to omit, and that while this tendency is 
reliable, it is not valid. 

Scores obtained from exact and approximate Guttman 
weights were compared by Green with several other 
weighted and unweighted scores for 2,500 men on the 
verbal portion of the Scholastic Aptitude Test. He 
concluded xlui weighted scoring is not to be recom- 
mended in this situation. 

Hendrickson and Green studied the effect of Cut^man 
weighted scoring on the factor structure of subtests of the 
Scholastic Aptitude Test, using rights-only scoring ao the 
basis for comparison. They found significant diffeiences 
in the factor structure of the two types of scores and 
concluded that the two scores measure different func- 
tions. This helps to explain why Guttman-weighting which 
increases the reliability of test scores often reduces their 
validity. 

Baker showed that the item response weighting tech- 
nique known as the method of reciprocal averages is a 
particular implementation of Guttman*s general model for 
internal consistency scaling. He argued that the reciprocal 
averages method is computationally simpler than the 
Guttman method, and that it can be implemented by a 
simple extension of existing item analysis computer 
programs. 



7. Confidence weighting 

Twenty-four graduate education majors enrolled in a 
c 'urse on measurement and evaluation took a 20 item 
Uisi cn which they had the option of either confidence 
weighted response or Cocmbs-type multiple response. 
Garvin, who conducted and reported the study, found 
that the weighited scoring procedures separately or in 
combination invariably depressed the reliability coeffi- 
cient. 

A second paper by Garvin presented a comprehensive 
discission of confidence weighting; its nature, purpose, 
varieties and effects. It is his opinion that confidence 
weighting procedures have a kind of intrinsic validity, for 
knowing how much confidence to place in a belief is an 
important aspect of a person's knowledge. 

In two closely related papers, Rippey discussed the 
rationale and development of confidence testing. He made 
a case for the use of "intrinsic items" which do not have 
unique correct responses and require the examinee to 
distribute his belief over the options. Several mathemati- 
cal functions that miglit be used in scoring confidence 
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wc»<jitcd responses were suggested. He concluded that 
effective use of confidence testing will require the 
solution of a number of psychological and educational 
problems. 

8. The Semantic Differential 

Raper and VVasik used the semantic differential technique 
to differentiate pre-school educational environments. They 
found that perceptions of the pre-school environment 
were related to the prior experience of the pcrceiver, and 
to the conditions under which the semantic differential 
data were gathered. 

Two hundred sixteen elementary school children rated 
the concept "myself on semantic differential scales 
defined by 55 adjective pairs in a study reported by 
Lynch and Cochran. Ratings of the 55 scales were factor 
analyzed overall and by grade. Three factors were extracted 
overall, but six were found in grade two and seven in each 
of grades four and six. This is a more complex judgmental 
structure than has previously been reported for grade 
school children. 

The semantic differential technique was used by 
Francies to measure the attitudes of sixth grade pupils 
toward a course in Family Life education. Drawings 
depicting family situations were displayed to the pupils 
by means of an overhead projector. The investigator 
concluded that this technique of attitude measurement 
merits wider use. 

Gulo summaiued the results of four recent studies 
involving use of the semantic differential to measure 
teacliing effectiveness. The factors revealed in these four 
studies differ somewh«3t Those that reappear tend to 
account for different proportions of the total variance. 
Nevertheless the author regards the technique as especially 
useful in quantifying student perceptions of effective 
teaching. 

In an essentially methodological paper, Lynch urged 
more frequent use the D statistic (generalized distance 
function) as a basis for comparing profiles on semantic 
differential data. He suggested that this approach would 
facilitate interpretable results on meaningful variables with 
botn efficiency and theoretical utility. 

9. Observational Techniques 

Cunningham and Boger described the Parent-Child Inter- 
action Rating Procedure. Video-taped sessions in which 
the parent teaches the child to perform a simple sorting 
task provide a record of the interactions which are rated. 
Among the aspects of interaction that are rated are voice 
tone, task orientation, reward and cues. The authors of 
the report S'^e wide usefulness of the procedure in the 
study of the teaching-learning of young children, and the 



development of their behavior patterns. 

The construction and validation of a theoretically 
based system for the analysis of teacliing roles in 
childhood education was reported by Southwell and 
Webb. Four teaching roles-acquisition, inquiry, mother- 
surrogate, and authenticity-were defined, and teaching 
behaviors presumably characteristic of each role were 
identified. The investigator^ ^ound that teachers in differ- 
ent scnoo's :*nQ in differe;. e iucational programs exhi- 
bited chara -'eristicaiiy diP at behaviors. 

The roG an^i .* 'cs v administered twice to 70 
children, S \, 1 ye;i's • i^e, aK)ng with an intelligence 
test Sig'^.iicant difierCiH,!,.. wire found for age groups and 
sex groups, but not between first and second test 
administrations. Reliability coefficients ranged from .43 to 
.72 for different age groups and with different intervals 
between testing. Correlation of rod and frame test scores 
with intelligence test scores was low. The investigators, 
Busch and Simon, conclude that further definition of the 
construct will be necessary. 

10. Testing Programs 

Seven papers in this group wcr^ presented as part of a 
symposium on "The Madison Plan: A New Approach to 
System-Wide Testing." Mathews, stated the purpose of le 
symposium and described its structure. The symposium, 
he said, takes a long hard look at a school district testing 
program, finds it inadequate and dysfunctional, proposes 
an alternative structure, and discusses the results from, 
and problems with, attempts to implement this structure. 

Cleary and Mathews der^Ciibed how dissatisfaction with 
the existing program, which involved massive testing with 
minimum use of the results, led to organization of the 
Nucleus Testing Committee. This committee representing 
all schools in the system was charged with becoming 
knowledgeable about tests and deteimining what kind of 
data the schools needed about children. 

Seeman reported results of a survey of the evaluation 
concerns of various members of the school staff. The 
survey revealed continuing concern for the capacity/ 
achievement dimension, and some discordant priorities of 
various staff groups and staff members. 

How and why the testing program of the Madison 
Public Schools was reduced to reading in grades 1,2, 3, 4, 
5, and 8, and mathematics in grade 5 was described by 
Nettleton. She noted that when more specific criterion- 
referenced tests are adopted, testing every pupil will be 
replaced by random sampling to provide the same 
normative data for less time and cost. 

Christiansen described the work of the Curriculum- 
Related Subcommittee, its accomplishments and plans for 
further development of curriculum-related tests. The 
subcommittee is emphasizing objective-based instruction, 
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criterion-referenced testing and program-fair assessment. 

One facet of the "Madison Plan" for system wide 
testing called for exploration of testing in the affective 
domain. A report by Hansen describes what the affective 
subcommittee did during 18 months to acquire an 
understanding of the affective domain and to develop 
criteria for an affective testing program. Creation of the 
tests remains a task for the future which, in the view of 
the committee, is briglitcr. 

Presenting an administration view of the "Madison Plan", 
Sapone (Director of Curriculum Development in Madison 
Public Schools) rejects the assumptions of norm refer- 
enced testing, deplores its effects on teaching and learn- 
ing, and suggests that it ought to disappear quickly. He 
believes that the new criterion referenced testing program 
being developed in Madison should begin to pay dividends 
within the next few years not only to Madison but to the 
whole nation. 

In another paper Mathews described computer gener- 
ated verbal reports of test scores for parents and teachers. 
These reports supplement numerical reports and eliminate 
the need for tables of percentiles or grade equivalent 
scores. Teachers tended to rate the verbal reports higiier 
than the conventional reports in clarity, usefulness, 
meaning, value, sufficiency and accuracy. 

11. Test Bias 

Do eighth grade students from minority groups perform 
ditferenily on tests of academic aptitude and achievement 
when they know their ^core<5 will be compared with 
scores of (1) other minority students or (2) majority 
group students'^ The answer from a study by Oakland and 



Emmer is negative. However the nature of Tie comparison 
group did have some effect on thtii expectations of 
performance level. 

Is the perfoimance of a first, second or third grade 
pupil on an individual test of intelligence affected by the 
race of the examiner? Savage and Bowers found the 
answer to be yes in their study. Pending further study of 
the complex interactions they recommend that tester and 
student tested be of the same race. 

Green investigated the possibility that an elementary 
school achievement test migiit be biased against minority 
group members. In his view a test is biased against a 
particular group if it contains a substantial proportion of 
items that would not have been selected if the item 
tryout had been made in that particular group. He found 
substantial evidence for bias of that kind, and suggested 
that producers of standardized tests would sliow more 
concern for eliminating it. 



12. Concluding Remarks 

On the whole these reports reflect competently executed 
research studies. In a few cases essential data that should 
have been available to the researcher were not given in 
the report, and in some cases the reader is left in doubt 
concerning what particular questions the researcher was 
trying to answer and what answers he thought he had 
found. Critical readers, basing reactions on their own 
special interests and perceptions of truth, are likely to see 
shortcomings in many of the studies reported. But they 
are also likely to learn sometliing of value from most of 
them. 
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